This document presents Map 2: Thematic Structure of the DigiKat project. The core question driving this analysis is: What is Croatian Catholic digital media about, and who specializes in what?
We use topic modeling to discover the thematic landscape of the corpus and then examine how different actors and platforms specialize in particular topics.
1.1 Methodological Overview
Topic modeling is an unsupervised machine learning technique that discovers latent thematic patterns in large text corpora. Unlike manual content analysis, topic models can process hundreds of thousands of documents and identify recurring themes without predefined categories.
We employ Structural Topic Modeling (STM), an extension of Latent Dirichlet Allocation (LDA) that incorporates document metadata (actor type, platform, date) as covariates. This allows us to examine not only what topics exist but also how topic prevalence varies across different communicators and platforms.
Key concepts:
Topic: A probability distribution over words. Topics represent coherent themes (e.g., liturgical practice, social issues, Vatican news)
Document-topic distribution (theta): Each document is a mixture of topics. A post might be 60% liturgical and 40% devotional
Topic prevalence: The average proportion of the corpus devoted to each topic
FREX words: Words that are both frequent AND exclusive to a topic (helps distinguish topics)
Interpretation guidance:
High prevalence topics represent dominant themes in the discourse
Topics with high FREX distinctiveness are easier to interpret
Platform-topic patterns show how different media afford different content types
Show code
dta <-readRDS("C:/Users/lsikic/Luka C/HKS/Projekti/Digitalni Kat/SHKM/DigiKat/data/merged_comprehensive.rds") %>%filter(SOURCE_TYPE !="tiktok", !is.na(SOURCE_TYPE)) %>%filter(DATE >=as.Date("2021-01-01") & DATE <=as.Date("2025-12-31")) %>%filter(year >=2021& year <=2025)setDT(dta)# Check data loaded correctlycat("Data loaded:", nrow(dta), "rows\n")
Data loaded: 608879 rows
Show code
if (nrow(dta) ==0) stop("No data after filtering! Check your date filters.")n_posts <-nrow(dta)n_with_text <-sum(!is.na(dta$FULL_TEXT) &nchar(dta$FULL_TEXT) >50, na.rm =TRUE)
Before topic modeling, text must be cleaned and tokenized. Preprocessing involves:
URL removal: Eliminate web addresses that add noise
Lowercasing: Standardize case for consistent matching
Punctuation/number removal: Keep only alphabetic content
Stopword removal: Filter common words that dont carry topical meaning
Minimum character filter: Remove very short tokens (< 3 characters)
Frequency trimming: Keep only terms appearing in multiple documents
Why stopwords matter: Croatian has rich morphology, so stopwords must cover all declensions and conjugations. The list includes auxiliaries (sam, si, je), prepositions (u, na, za), pronouns (ja, ti, on), and generic terms that appear across all topics.
# Safety checks before running STMif (ndoc(dfm_cath) ==0) {stop("DFM has 0 documents! Check your data loading and filtering steps.")}if (nfeat(dfm_cath) ==0) {stop("DFM has 0 features! Check your text preprocessing and trimming parameters.")}if (ndoc(dfm_cath) <100) {warning("Very few documents (", ndoc(dfm_cath), "). Results may be unreliable.")}stm_dfm <-convert(dfm_cath, to ="stm")# Verify conversion workedif (length(stm_dfm$documents) ==0) {stop("STM conversion resulted in 0 documents!")}# Use NUM_TOPICS from config (set at top of file)K <- NUM_TOPICS# Adjust K if we have too few documentsif (length(stm_dfm$documents) < K *3) { K <-max(5, floor(length(stm_dfm$documents) /3))warning("Reduced K to ", K, " due to small corpus size")}stm_model <-stm(documents = stm_dfm$documents,vocab = stm_dfm$vocab,K = K,prevalence =~ actor_type + platform,data = stm_dfm$meta,init.type ="Spectral",max.em.its =50,verbose =FALSE,seed =42)cat("STM model fitted with", K, "topics\n")
STM model fitted with 35 topics
2.2 Topic Labels and Top Words
Each topic is characterized by words that appear frequently within it. We present two types of word rankings:
Top Words (Probability): Words with highest probability of appearing in documents assigned to this topic. These are the most common terms but may appear across multiple topics.
FREX Words (Distinctive): Words that are both frequent AND exclusive to this topic. FREX (Frequency and Exclusivity) scoring helps identify what makes each topic unique. These are typically more useful for interpreting topic content.
How to read the table:
Look at FREX words first to understand what distinguishes each topic
Topics with clear thematic coherence (e.g., all words related to liturgy) are well defined
Topics with mixed words may capture multiple related themes or require further refinement
Topic prevalence measures what proportion of the corpus is devoted to each topic. In STM, each document is modeled as a mixture of topics, so prevalence represents the average proportion across all documents.
Interpretation:
Higher prevalence topics represent dominant themes in the corpus
The distribution typically follows a long tail pattern, with a few dominant topics and many niche ones
Prevalence alone does not indicate importance; niche topics may be highly significant for specific actor types or contexts
par(mfrow =c(5, 7), mar =c(0.5, 0.5, 1.5, 0.5))for (i in1:K) {cloud(stm_model, topic = i, max.words =30, colors =brewer.pal(8, "Dark2"),main =paste("Topic", i))}
Show code
par(mfrow =c(1, 1))
3 Analysis 2.2: Topic Taxonomy
Group discovered topics into higher order thematic categories based on manual inspection of top words.
3.1 Automated Category Assignment
To organize the 35 discovered topics into meaningful higher level categories, we use keyword matching against predefined thematic dictionaries. Each topic is assigned to the category with the most keyword matches among its top and FREX words.
Category definitions:
Category
Description
Example keywords
Liturgical/Sacramental
Mass, sacraments, liturgical calendar
misa, sakrament, euharistij, uskrs
Devotional
Prayer, saints, personal piety
molitva, gospa, svetac, duhov
Institutional
Church hierarchy, appointments, governance
biskup, papa, vatikan, imenovan
Social/Ethical
Family, charity, social issues
obitelj, caritas, siromašn, brak
Political
Politics, history, national identity
vlada, hrvat, rat, domovina
Youth/Community
Youth ministry, community activities
mladi, frama, kamp, zajednic
Educational
Catechesis, theology, learning
kateh, teolog, biblij, učenj
Topics that do not match any category are labeled Other. This automatic classification provides a starting point; manual refinement may be needed for edge cases.
Examine how different actor types specialize in particular topics.
4.1 Understanding Actor-Topic Relationships
Different actors in the Catholic digital space have distinct communication priorities. Official institutions may focus on announcements and governance, while charismatic communities emphasize devotional content. This analysis reveals these specialization patterns.
Key concepts:
Topic proportion by actor: The average topic distribution for documents from each actor type
Specialization (Lift): How much more (or less) likely an actor type is to discuss a topic compared to the corpus average
Lift interpretation:
Lift = 1.0: Actor discusses this topic at the same rate as the overall corpus
Lift > 1.0: Actor specializes in this topic (e.g., Lift = 2.0 means twice as likely)
Lift < 1.0: Actor underrepresents this topic compared to average
This heatmap visualizes the relationship between actor types (columns) and topics (rows). Darker colors indicate higher proportions of that topic in an actor types content.
How to read the heatmap:
Each cell shows what percentage of an actor types content belongs to a topic
Rows are clustered by similarity, so related topics appear near each other
Columns are also clustered, revealing which actor types have similar thematic profiles
Look for dark cells to identify specializations; look for light/missing cells to identify topics an actor avoids
Calculate how specialized each actor type is compared to corpus average.
The Lift metric provides a precise measure of specialization. For each actor topic pair, we calculate:
\[\text{Lift} = \frac{\text{Actor's topic proportion}}{\text{Corpus average topic proportion}}\]
This table shows the top 3 topics where each actor type most strongly overindexes. A high lift value (e.g., 3.0x) indicates the actor discusses this topic three times more than average, suggesting it is central to their communication strategy.
category_by_actor <- specialization_df %>%group_by(actor_type, Category) %>%summarise(Category_Proportion =sum(Proportion), .groups ="drop")ggplot(category_by_actor, aes(x = Category, y = Category_Proportion, fill = Category)) +geom_col(width =0.7) +facet_wrap(~actor_type, scales ="free_y", ncol =3) +scale_fill_brewer(palette ="Set2") +scale_y_continuous(labels =function(x) paste0(round(x *100, 1), "%")) +labs(title ="Thematic Category Profiles by Actor Type",subtitle ="What each actor type talks about",x =NULL,y ="Proportion of Content" ) +theme(axis.text.x =element_text(angle =45, hjust =1, size =8),legend.position ="none",strip.text =element_text(face ="bold") )
5 Analysis 2.4: Topic by Platform Mapping
How do topics vary across digital platforms?
Different platforms have distinct affordances that shape communication: web allows long form content, Facebook enables community discussion, Instagram favors visual storytelling, and YouTube supports video content. These technical characteristics influence what topics appear where.
Expected patterns:
Web: More institutional, news oriented, and educational content (supports detailed text)
Social media (Facebook, Instagram): More devotional and community content (personal, shareable)
YouTube: Mix of educational and devotional (video catechesis, prayer channels)
Comments/Forums: More reactive content, discussion of controversial topics
This heatmap uses a warm color palette (white to orange to red) to distinguish it from the actor heatmap. The same interpretation applies: darker colors indicate higher topic prevalence on that platform.
This analysis addresses a key hypothesis: that certain types of content (particularly political and controversial topics) generate disproportionately high engagement compared to their prevalence. This pattern, common in digital media, reflects the attention economy dynamics where provocative content outperforms routine communication.
Key metrics:
Weighted Mean Engagement: Average interactions per document, weighted by topic proportion. A document that is 80% about Topic X contributes more to Topic Xs engagement score than one that is 10% Topic X.
Total Engagement: Sum of all interactions attributed to a topic (useful for absolute comparison)
Engagement Efficiency: Topics above the trend line in the scatterplot punch above their weight, generating more engagement than their prevalence would predict
Interpretation:
Topics with high engagement but low prevalence represent niche but highly resonant content
Topics below the trend line may indicate routine, lower engagement content (e.g., announcements, schedules)
Show code
doc_engagement <- dta_sample[, .(doc_id, INTERACTIONS, REACH)]# Create row indices for joiningdoc_meta_with_id <- doc_meta |>mutate(row_id =row_number())doc_engagement_with_id <- doc_engagement |>mutate(row_id =row_number())# Join by row_id since both datasets have same orderdoc_meta_eng <- doc_meta_with_id |>left_join(doc_engagement_with_id, by ="row_id", suffix =c("", "_y"))# Now proceed with topic engagement calculationtopic_engagement <-lapply(1:K, function(t) { topic_col <-paste0("Topic_", t) doc_meta_eng |>mutate(weight = .data[[topic_col]]) |>summarise(Topic = t,Weighted_Mean_Engagement =weighted.mean(INTERACTIONS, weight, na.rm =TRUE),Mean_Engagement =mean(INTERACTIONS[.data[[topic_col]] >0.1], na.rm =TRUE),Total_Engagement =sum(INTERACTIONS * weight, na.rm =TRUE),Document_Count =sum(.data[[topic_col]] >0.1) )}) |>bind_rows()topic_engagement <- topic_engagement |>left_join(topic_taxonomy, by ="Topic") |>left_join(topic_prev_df |>mutate(Topic =as.numeric(as.character(Topic))), by ="Topic")
6.1 Topic Engagement Ranking
Show code
ggplot(topic_engagement, aes(x =reorder(factor(Topic), Weighted_Mean_Engagement),y = Weighted_Mean_Engagement, fill = Category)) +geom_col(width =0.7) +geom_text(aes(label =round(Weighted_Mean_Engagement, 0)), hjust =-0.1, size =3) +coord_flip() +scale_fill_brewer(palette ="Set2") +scale_y_continuous(limits =c(0, max(topic_engagement$Weighted_Mean_Engagement, na.rm =TRUE) *1.15)) +labs(title ="Average Engagement by Topic",subtitle ="Weighted mean interactions per document",x ="Topic",y ="Weighted Mean Engagement",fill ="Category" )
6.2 Volume vs Engagement Comparison
This scatterplot is one of the most important visualizations in the analysis. Each point represents a topic, positioned by its prevalence (x axis) and average engagement (y axis).
How to interpret:
Trend line (dashed): Shows the expected relationship between prevalence and engagement
Points above the line: Topics that generate more engagement than expected given their prevalence (engagement overperformers)
Points below the line: Topics that underperform in engagement relative to their volume
Point size: Indicates total engagement volume
This visualization directly tests the hypothesis that certain content types (often political or controversial) punch above their weight in the attention economy.
Show code
topic_engagement <- topic_engagement %>%mutate(Prevalence_Rank =rank(-Prevalence),Engagement_Rank =rank(-Weighted_Mean_Engagement),Engagement_Efficiency = Engagement_Rank - Prevalence_Rank )ggplot(topic_engagement, aes(x = Prevalence, y = Weighted_Mean_Engagement)) +geom_point(aes(color = Category, size = Total_Engagement), alpha =0.7) +geom_text_repel(aes(label = Topic), size =3, max.overlaps =15) +geom_smooth(method ="lm", se =TRUE, linetype ="dashed", color ="gray50") +scale_color_brewer(palette ="Set2") +scale_size_continuous(range =c(3, 12), labels = comma) +labs(title ="Topic Prevalence vs Engagement",subtitle ="Points above the trend line are engagement overperformers",x ="Prevalence (%)",y ="Weighted Mean Engagement",color ="Category",size ="Total Engagement" )
Topics in the Political and Social/Ethical categories often generate disproportionately high engagement relative to their prevalence, supporting the hypothesis that controversial content attracts more attention.
7 Analysis 2.6: Temporal Topic Trends
How do topics evolve over time?
Understanding temporal dynamics reveals how the thematic composition of Catholic digital media shifts in response to events, seasons, and longer term trends. This analysis aggregates topic prevalence by year to identify patterns of growth and decline.
What to look for:
Seasonal patterns: Liturgical topics may spike around major feasts (Christmas, Easter)
Event driven changes: Political topics may surge during elections or controversies
Structural shifts: Long term trends in what the Catholic digital space discusses
Interpretation caveats:
Changes between years may reflect genuine shifts OR changes in sampling/sources
Short time series (4 years) limits our ability to distinguish trends from fluctuations
Stacked area charts show relative proportions; a decline in share does not necessarily mean absolute decline
This analysis identifies which specific topics have increased or decreased in prevalence over the study period. We compare the first and last year in the dataset to calculate absolute and percentage change.
Interpretation:
Growing topics: Themes receiving increasing attention in the Catholic digital space
Declining topics: Themes that have become less prominent over time
Percentage change is most meaningful for topics with substantial initial prevalence; small topics may show large percentage swings with minimal absolute change
BERTopic uses pretrained language models to create semantically meaningful embeddings before clustering.
Advantages: - Captures semantic similarity beyond word co-occurrence - Handles synonyms and related concepts automatically - Produces more coherent, interpretable topics - Does not require extensive stopword lists
Implementation in R/Python:
# Python example (requires bertopic package)from bertopic import BERTopicfrom sentence_transformers import SentenceTransformer# Use multilingual model for Croatianembedding_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")topic_model = BERTopic(embedding_model=embedding_model, language="multilingual")topics, probs = topic_model.fit_transform(documents)
9.2.2 2. Multilingual Language Models
For Croatian text, several pretrained models are available:
Model
Description
Use case
mBERT (multilingual BERT)
Trained on 104 languages including Croatian
General embeddings
XLM-RoBERTa
Strong multilingual performance
Semantic similarity
CroSloEngual BERT
Specialized for Croatian, Slovenian, English
Best for Croatian
paraphrase-multilingual-MiniLM
Efficient, good for clustering
BERTopic embeddings
9.2.3 3. Zero-shot Classification
Instead of discovering topics, assign documents to predefined categories using language models:
Extract structured information (people, organizations, places) using: - classla (for Slovenian and Croatian NLP) - stanza with Croatian models - spaCy with custom Croatian models
9.2.5 5. Semantic Search and Clustering
Modern approach workflow: 1. Embed all documents using multilingual transformers 2. Reduce dimensions with UMAP 3. Cluster with HDBSCAN 4. Extract topic labels using c-TF-IDF or LLM summarization
9.3 Recommended Next Steps
Short term: Refine STM stopwords and experiment with different K values
Medium term: Implement BERTopic with multilingual embeddings for comparison
Long term: Build custom Croatian language model fine-tuned on religious text